8 research outputs found

    The Architecture of the XtreemOS Grid Checkpointing Service

    Get PDF
    The EU-funded XtreemOS project implements a grid operating system (OS) transparently exploiting distributed resources through the SAGA and POSIX interfaces. XtreemOS uses an integrated grid checkpointing service (XtreemGCP) for implementing migration and fault tolerance. Checkpointing and restarting applications in a grid requires saving and restoring applications in a distributed heterogeneous environment. The latter may spawn millions of grid nodes using different system-specific checkpointers saving and restoring application and kernel data structures on a grid node. In this paper we present the architecture of the XtreemGCP service integrating existing checkpointing solutions. Our architecture is open to support different checkpointing strategies that can be adapted according to evolving failure situations or changing application requirements. We propose to bridge the gap between grid semantics and system-specific checkpointers by introducing a common kernel checkpointer API that allows using different checkpointers in a uniform way. Furthermore, we discuss other grid related checkpointing issues including resource conflicts during restart, security, and checkpoint file management. Although this paper presents a solution within the XtreemOS context it can be applied to any other grid middleware or distributed OS, too

    Independent Checkpointing in a Heterogeneous Grid Environment

    Get PDF
    The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support different checkpointing protocols and to address the underlying grid-node checkpointers (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform interface. In this paper, we present the integration of an independent checkpointing and rollback-recovery protocol into the XtreemGCP. The solution we propose is not checkpointer bound and thus can be transparently used on top of any grid-node checkpointer. To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability.Le projet XtreemOS financé par l'Union Européenne met en oeuvre un système d'exploitation open-source pour grille basé sur Linux. Afin d'offrir tolérance aux fautes et migration d'applications pour grilles, il intéragit avec un service distribué de sauvegarde de points de reprise de processus appelé XtreemGCP. Ce service est conçu pour supporter différents protocoles de sauvegarde de points de reprise de processus et pour s'interfacer avec les systèmes de sauvegarde de points de reprise sous-jacents (par exemple BLCR, LinuxSSI, OpenVZ, etc.) de manière transparente à travers une interface uniforme. Dans cet article, nous présentons l'intégration d'un protocole indépendant de sauvegarde de points de reprise et de retour arrière dans XtreemGCP. La solution que nous proposons n'est pas limitée par le système de sauvegarde de points de reprise et peut ainsi être utilisée de façon transparente au-dessus de n'importe lequel. Nous évaluons ce prototype en l'exécutant dans un environnement hétérogène composé de simples noeuds PC et d'une grappe basée sur un système à image unique (SSI). Les résultats expérimentaux démontrent la capacité du service XtreemGCP à intégrer les différents protocoles de sauvegarde de points de reprise et à sauvegarder de manière indépendante un point de reprise d'une application distribuée s'exécutant sur un environnement de grille hétérogène. De plus, les évaluations de performance montrent que notre solution surpasse les protocoles coordonnés existants en terme de passage à l'échelle

    Checkpointing Process Groups in a Grid Environment

    No full text
    International audienceThe EU-funded XtreemOS project implements a grid operating system transparently exploiting resources of virtual organizations through the standard POSIX interface. Grid checkpointing and restart requires to save and restore jobs executing in a distributed heterogeneous grid environment. The latter may spawn millions of grid nodes ( PCs, clusters, and mobile devices ) using different system-specific checkpointers saving and restoring application and kernel data structures for processes executing on a grid node. In this paper we shortly describe the XtreemOS grid checkpointing architecture and how we bridge the gap between the abstract grid and the system-specific checkpointers. Then we discuss how we keep track of processes and how different process grouping techniques are managed to ensure that all processes of a job and any further dependent ones can be checkpointed and restarted. Finally, we present how Linux control groups can be used to address resource isolation issues during the restart

    Independent Checkpointing in a Heterogeneous Grid Environment

    Get PDF
    The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support different checkpointing protocols and to address the underlying gridnode checkpointers (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform interface. In this paper, we present the integration of an independent checkpointing and rollback-recovery protocol into the XtreemGCP. The solution we propose is not checkpointer bound and thus can be transparently used on top of any grid-node checkpointer. To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability

    Literatur

    No full text
    corecore